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Abstract: Wc consider a ^-penalization procedure in the non-paramctric 
Gaussian regression model. In many concrete examples, the dimension d of 
the input variable X is very large (sometimes depending on the number 
of observations). Estimation of a /3-regular regression function / cannot 
be faster than the slow rate n~ 2 ^^ 2 ^ +d \ Hopefully, in some situations, 
/ depends only on a few numbers of the coordinates of X. In this paper, 
we construct two procedures. The first one selects, with high probability, 
these coordinates. Then, using this subset selection method, we run a local 
polynomial estimator (on the set of interesting coordinates) to estimate the 
regression function at the rate n~ 2 /V( 2 /3+<i ) ] where d* , the "real" dimen- 
sion of the problem (exact number of variables whom / depends on), has 
replaced the dimension d of the design. To achieve this result, we used a li 
penalization method in this non-parametric setup. 
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1. Introduction 

We consider the non-paramctric Gaussian regression model 

Yi = f(Xi) + ei, i = l,...,n, 

where the design variables (or input variables) X\, . . . ,X n are n i.i.d. random 
variables with values in ~B. d , the noise ei, . . . , e„ are n i.i.d. Gaussian random 
variables with variance a 2 independent of the Xi 's and / is the unknown regres- 
sion function. In this paper, we are interested in the pointwisc estimation of / 
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at a fixed point x = (xi, . . . , X4) € M d . We want to construct some estimation 
procedures /„ having the smallest pointwise integrated quadratic risk 

E(f n (x)-f(x)f (1) 

using only the set of data D n = (Yi, Xj)i<,< n . 

Assuming that the regression function enjoys some regularity properties around 
x is a classical assumption for this problem. In this work, we assume / to be 
/3-Hldcrian around x. We recall that a function / : M d 1 — ► M is fi-Hlderian at 
the point x with /3 > 0, denoted by / € E(/3, x), when the two following points 
hold: 

• / is i-times diffcrcntiable in x (where I = |_/?J is the largest integer which 
is strictly smaller than 0), 

• there exists L > such that for any t = (ti, . . . , t n ) <G B 00 (x, 1), 

\f(t)-Pi(f)(t,x)\<L\\t-x\\f, 

where P; (/)(■, x ) is the Taylor polynomial of order I associated with / at 
the point x, || • ||i is the h norm and 2?oo(x, 1) is the unit loo-ball of center 
x and radius 1. 

When / is only assumed to be in E(/3, x), no estimator can converge to / (for 
the risk given in equation (1)) faster than 

This rate can be very slow when the dimension d of the input variable X is large. 
In many practical problems, the dimension d can depend on the number n of 
observations in such a way that the rate (2) does not even tend to zero when n 
tends to infinity. This phenomenon is usually called the curse of dimensionality. 
Fortunately, in some of these problems the regression function really depends 
only on a few number of coordinates of the input variables. We formulate this 
heuristic by the following assumption: 

Assumption 1. There exist an integer d* < d, a function g : R d * — > M and 
a subset J = {ix, . . . , i^*} C {1, . . . , d} of cardinality d* such that for any 
(x u ...,x d ) eR d 

f(xi, ...,x d )= g(x n ,. . .,x id ,). 

Under Assumption 1, the "real" dimension of the problem is not anymore d 
but d* . Then, we hope that if / G H(f3,x) (which is equivalent to say that g 
is /3-Hlderian at the point x), it would be possible to estimate f(x) at the rate 
given in equation (2) where d is replaced by d* , leading to a real improvement 
of the convergence rate when d* « d. Nevertheless, starting from the data 
D n , it is not clear that detecting the set of interesting coordinates J is an easy 
task. To select this set, we use a li penalization technique. This technique has 
been mostly used in the parametric setup (cf. Bickcl ct al. (2008), Zhao and Yu 
(2006), Mcinshausen and Yu (2008) and references therein). In the present work, 
we adapt it to the non-parametric setup and we obtain our first result in this 
theorem which is a short version of Theorem 1. 
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Theorem A (selection of the subset J). Under Assumption 1 it is possible 
to construct, only from the data D n , a subset J C {1, ...,d} such that, with 
probability greater than 1 — Co exp(co<i — c\nh d+2 ) (for a free parameter < h < 

u 

J = J. 

Once the set J is empirically determined with high probability, we then run a 
classical local polynomial estimation procedure on the set of indices J to obtain 
the following theorem which is a short version of Theorem 2. 

Theorem B (estimation of /). For any f £ £(/?, x), with (3 > 1, satisfying 
Assumption 1, it is possible to construct, only from the data D n , an estimation 
procedure f„ such that 

&[\U(x) - /(<»0I > <*] < ccM-c8 2 n 2ti/{2ll+d * ) )y5 > 
where c does not depend on n. 

The last theorem proves that it is possible, only from the set of data, to 
reduce and to detect the "real" dimension of the problem under Assumption 1. 

The problem we consider in the paper is called a high-dimensional problem. 
In the last years, many papers have studied these kinds of problems and sum- 
marizing here the state of the art is not possible (we refer the reader to the 
bibliography of Lafferty and Wasserman (2008)). We just mention some papers. 
In Bickcl and Li (2007); Lcvina and Bickcl (2005); Bclkin and Niyogi (2003); 
Donoho and Grimes (2003), it is assumed that the design variable X belongs 
to a low dimensional smooth manifold of dimension d* < d. All of these work 
are based on heuristics techniques. In Lafferty and Wasserman (2008), the same 
problem as the one considered here is handled. Their strategy is a greedy method 
that incrementally searches through bandwidth in small steps. If the regression 
/ is in a Sobolev ball of order 2, their procedure is nearly optimal for the point- 
wise estimation of / in x. It achieves the convergence rate n~ 4 ^ A+d +e ) for every 
e > 0, when d = O(logn/ log log n) and d* =0(1). Our procedure improves this 
result. First, the optimal rate of convergence is achieved. Second, the regression 
function does not have to be twice diffcrentiablc (actually we have the result for 
any (3 > 1). Third, the dimension d can be taken of the order of logn. 

The paper is organized as follows. In the coming section, we construct the 
procedures announced in Theorem A and B. The exact version of Theorem A 
and B are gathered in Section 3. Their proofs are given in Section 4. 

2. Selection and estimation procedures 

Our goal is twofold. First, we want to determine the set of indices J = . . . , id-} 
Second, we want to construct an estimator of the value f{x) that converges to 
the rate n~ 2l3 ^ 2l3+d ' when / £ S(/3, x) for /3 > 1. To achieve the first goal, we 
use a l\ penalization of local polynomial estimators. 
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2.1. Selection procedure 



We consider the following set of vectors 



0(A) = are min 



i=l 



(3) 

where U{v) = (1, v u ...,v d ) for any v = ( Vl ,..., v d y £ W l , ||#j|i = £* =0 |^| for 
any = (0 O , . . . , d )' e ft, > is called the bandwidth, A > is called the 

regularization parameter and if : M. d — ► R is called the kernel. We will explain 
how to choose the parameters h and A in what follows. In the following, we 
denote Uo(v) = 1 and Ui{v) = Vj, for i = 1, . . . , d for any v = (v\, . . . , u<j) e M d . 
The kernel i4T is taken such that the following set of assumptions holds: 

Assumption 2. The kernel K : R d — ► R is symmetric, supported in #oo(0, 1), 
the matrix (J Rd K(y)Ui(y)Uj(y)dy) i j e ^ ,« is diagonal with positive coeffi- 
cients independent of d in the diagonal and there exists a constant Mk > 1 mcfe- 
pendent ofd which upper bounds the quantities max ugR d max^gjjd K(u) 2 , 

m&x ueRd \K(u)\\\u\\ 2 , max u6R£i J Rd K(y) 2 (l+\\y\\ 2 )dy, J Rd \K{u)\ 2 x 

||u||f(iu and J Rd K(y) 2 (U t (y)U 3 (y)) 2 dy. 

Note that for example the uniform kernel K{u) = ^ 1{ugs 00 (o,i)} satisfies 
the Assumption 2. 

Any statistic (9 6 B(A) is a l\ penalized version of the classical local poly- 
nomial estimator. Usually, for the estimation problem of f(x), only the first 
coordinate of 9 is used. Here, for the selection problem, wc will use all the co- 
ordinates except the first one. We denote by 9 the vector of R d made of the d 
last coordinates of 9. 

We expect the vector 9 to be sparse (that is with many zero coordinates) such 
that the set of all the non-zero coordinates of 9, denoted by J, will be the same 
as the set J of all the non-zero coordinates of (9\, ... , 6>^)* where 9* = hdif(x), 
for ie {1, . . . , d}, and dif(x) stands for the i— th derivative of / at point x. We 
remark that, under Assumption 1, the vector (9*, ... ,9%)* is sparse. 

Note that, the estimator 9 £ 8(A) may not be unique (depending on d and n). 
Hence, the subset selection method may provide different subsets J depending 
on the choice of 9. Nevertheless, Theorem 2 holds for any subset J, whatever is 
the vector 9 chosen in 8(A). 

We also consider another selection procedure close to the previous one which 
requires less assumption on the regression function. We just need to assume 
that there exists / max > such that < /max- With the same notation, we 

consider the following set of vectors 



89(A) = are min 



— V 



Yi + f m ax + Ch-U[ K 



Xi-x\ „\ 2 „ fX, 



h d 

2A||% , (4) 
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where C and h will be given later. We just translate the outputs Y,'s by f m ax + 
Ch. This translation affects the estimator since the LASSO method is not a 
linear procedure. We denote by J2, this subset selection procedure. 

Remark 1. The l\ penalization technique can be related to the problem of linear 
aggregation (cf. Nemirovski (2000) and Tsybakov (2003)) in a sparse setup. 
Indeed, l± penalization is known to provide sparse estimators if the underlying 
object to estimate is sparse with respect to a given dictionary. Assumption 1 can 
be interpreted in terms of sparsity of f w.r.t. to a certain dictionary. For that, 
we consider the set T = {/o, fx, ■ ■ ■ , fd] of functions from W l to M where fo = t 
is the constant function equals to 1 and fj(t) = (tj — Xj)jh for any j G {1, . . . , d} 
and t= (ti, . . . ,td) € M. d . The set T is the dictionary. That is the set within we 
are looking for the best sparse linear combination of elements in T approaching f 
in a neighborhood ofx. In this setup, the Taylor polynomial of order 1 at point x, 
denoted by P\{f){-,x), is a linear combination of the elements in the dictionary 
T . When f is assumed to belong to S(/3, x), the polynomial P\ (/)(•, x) is a good 
approximation of f in a neighborhood ofx. Moreover, under Assumption 1, this 
linear combination is sparse w.r.t. the dictionary T . Thus, we hope that, with 
high probability, minimizing a localized version of the empirical L^-risk penalized 
by the h norm over the set of all the linear combinations of elements in T will 
detect the right locations of the interesting indices i\, . . . , id* (which correspond 
to the non-zero coefficients of P\{f){- ,x) in the dictionary T). That is the main 
idea behind the procedures introduced in this section since we have: 

Of course, we can generalize this approach to other dictionaries (this will lead 
to other sparsity and regularity properties of f) provided that the orthogonality 
properties of T ( cf. Proposition 1 ) still hold. 



9(A) = arg mi 

0GR c 



2.2. Estimation procedure 

We now construct a classical local polynomial estimator (LPE) (cf. 
Korostelev and Tsybakov (1993); Tsybakov (1986)) on the set of coordinates 
J2 previously constructed. 

We assume that the selection step is now done. We have at hand a subset 
J2 = {Ti, . . . Sji} C {1, . . . , d} of cardinality d* . For the second step, we consider 

j x a polynomial on R''* of degree I = \_/3\ which minimizes 

£ (Yi ~ IxipiX, - x))f K* (p{^-)) 
i=i \ ' 

where h* = n~ 1 ^ 2 ^ +d *\ p(v) = (tr- , . . . , )* for any v = (vx, . . . , v d y G R d 

and K* : W l * — > M is a kernel function. The local polynomial estimator of / 
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at the point x is %(()) if j x is unique and otherwise. We denote by f[x) the 
projection onto [— f ma x', fmax] of the LPE of f(x). Here, we don't use the other 
coefficients of 7 X (0) like we did in the selection step. 

For the estimation step, we use a result on the convergence of multivariate 
LPE from Audibert and Tsybakov (2007). We recall here the properties of the 
kernel required in Audibert and Tsybakov (2007) to obtain this result. 

Assumption 3. The kernel K* : M. d " — ► R is such that: there exists c > 
satisfying 



K*(u) > dL Wa < 0J Vu 6 M d *; ( K*{u)du = 1; 




3. Results 

In this section, we provide the main results of this work. To avoid any technical 
complexity we will assume that the density function fi of the design X satisfies 
the following assumption: 

Assumption 4. There exists some constants n, fx m > 0, [im > 1 and > 
such that 

• Boo^x,^) C supp(^i) and fi m < fi(y) < fj, M for almost every y g B^x.rf), 

• fx is L^-Lipschitzian around x, that is for anyt £ Boc(x, 1), \fx(x) — /ti(t)| < 

||as — t\\oo (remark that the value n(x) is the value of the continuous 
version of \i around x). 

The first result deals with the statistical properties of the selection proce- 
dure. For this step, we require a weaker regularity assumption for the regression 
function /. This assumption is satisfied for any /3-Hlderian function in x with 
f3 > 1. 

Assumption 5. There exists an absolute constant L > such that the following 
holds. The regression function f is differ entiable and 

\f(t)-P 1 (f)(t,x)\<L\\t-xf 1 , Vte Boots, 1), 

where Pi(f)(-,x) is the Taylor polynomial of degree 1 of f at the point x. 

To achieve an efficient selection of the interesting coordinates, we have to be 
able to distinguish the non-zero partial derivatives of / from the null partial 
derivatives. For that, we consider the following assumption: 

Assumption 6. There exists a constant C > 72(/xm / '/j. J71 )LA/a'v / ^o such that 
\djf{x)\ > C for any j € J, where the set J is given in Assumption 1 and do is 
an integer such that d* < do. 
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Theorem 1. There exists some constants cq > and c\ > depending only 
on L[+, fx m , fXMi Mk, L, C and a for which the following holds. We assume 
that the regression function f satisfies the regularity Assumption 5, the sparsity 
Assumption 1 such that the integer d* is smaller than a known integer do and 
the distinguishable Assumption 6. We assume that the density function \x of the 
input variable X satisfies Assumption 4- 

We consider 6 = (6 , . . . , § d )_<E 9(A) C R d+1 and § 2 = ((0 2 )o, • • • , {h)d) & 
82(A) C R d+1 where 9(A) and 82(A) are defined in equations (3) and (4) with 
a kernel satisfying Assumption 2, a bandwidth and a regularization parameter 
such that 

< h < 00 ^ rTYT U A 77 and A = 8^3M K ^i M Lh. (5) 

6Z(do + LjLftMK 

We denote by J the set {j € {1, . . . , d} : 6j ^ 0} and by J2 the set {j 6 
{l,...,d}: (02h^O}. 

• If \f(x)\ > Ch, where C is defined in Assumption 6 or f(x) — ; then 
with probability greater than 1 — C\ exp(ci<i — conh d+2 ), J = J . 

• If \f(x)\ < fmax, then with probability greater than 1 — ciexp(ci<i — 
c nh d + 2 ), J 2 = J. 

We remark that Theorem 1 still holds when we only assume that there exists 
a subset J c {1, . . . , d} such that djf(x) = for any j ^ J instead of the more 
global Assumption 1. 

Theorem 2. We assume that the regression function f belongs to the Hlder 
class T,(P,x) with (3 > 1 and satisfies the sparsity Assumption 1 such that the 
integer d* is smaller than a known integer do and the distinguishable Assump- 
tion 6. We assume that the density function [i of the input variable X satisfies 
Assumption 4 cind |/(x)| < fmax- We assume that the dimension d is such that 
d + 2< (logn)/(-21ogft) (h satisfies (5)). 

We construct the set Ji of selected coordinates with a kernel, a bandwidth 
and a regularization parameter as in Theorem 1. The LPE estimator f{x) con- 
structed in subsection 2.2 on the subset J 2 and a kernel K* satisfying Assump- 
tion 3, satisfies 

V5 > 0,P[|/(x) - > 6} < ci exp ( - c 2 nWT^5 2 ^, 

where c\,c 2 > are constants independent ofn,d,d*. 

Note that, by taking the expectation, we obtain E[(/(x) — f(x)) j < cn 2 ' 3 + d " . 

Remark. The selection procedure is efficient provided that c\nh d+2 — cod tends 
to infinity when n tends to infinity. Namely, we need ( with < h < 1 ) 



(6) 



K. Bertin and G. Lecue/ Dimension reduction in non-parametric regression 1231 



It is interesting to note that, for d of the order o/logn (like in (6)), the rate 
of convergence in (2) does not tend to zero. Therefore, in this case and without 
any previous selection step, a classical LPE can fail to estimate f{x). 

A remarkable point of Theorem 1 is that the bandwidth h does not have to 
tend to when n tends to infinity. This particular behavior does not appear when 
LPE are used for estimation and not for selection. This can be explained because, 
we do not need to control any bias term in the selection step. The restriction on 
h comes only from the fact that we need the dictionary T to be approximatively 
orthogonal. 

Finally, once the set of interesting coordinates is selected, we can use it to run 
other non-parametric methods to estimate the function f with other pointwise 
risks or integrated risks and under other smoothness assumptions on f . Note 
that, by considering other order of the l\-penalized LPE in the selection step, it 
is easy to find other properties of the function f. For instance, inflection points 
or convexity of f can be detected with a second order method for the selection 
step. 

4. Proofs 

4-1- Proof of Theorem 1 

First note that, considering only the observations Xi in the neighborhood of x, 
an estimator 9 = (9q, . . . , Oj) G 6(A) defined in (3) can be viewed as a Lasso 
estimator in the linear regression model 

Z = A6*+ e, where 9* = (9* , . . . , fftf = (f(x), hdxf{x), hd A f[x)f (7) 

and, for any i = 1, . . . , n A, := a t f{X t ) - A t 9* and a, := ^K 1 ' 2 (^), 
the output vector Z of l n has for coordinates Zi := aiYi,i = 1, ...,n, the 
lines of the design matrix A e -M n ,d+i are Ai := cttU ( f7 x ) ,i = 1, ...,n 
(U is defined after Equation (3)) and the noise vector e has £, = Qnei + Ai 
for coordinates. We remark that the noise is not centered. The "localized" bias 
term A := (Ai, . . . , A„)* has been added to the noise. With this new notation, 
we have 

9(A) =arg min \\Z - A6\\ 2 2 + 2A||0||i 

where 6(A) has been introduced in equation (3) and Vz = (zi, . . . , z n ) € 

For the same reason, an estimator 62 E 62(A) defined in (4) can be viewed 
as a Lasso estimator in the linear regression model 

Z = AO* + e, where 9* = (f(x) + f max + Ch, hd 1 f(x), hd d f{x)) 1 

and Z has for coordinates Zi = + f max + Ch), i = 1, . . . ,n. Note that the 
Aj's are not affected by this translation. 

We start by studying 9 €_ 9(A) when \f(x)\ > Ch and 9 2 € 62(A) when 
\f( x )\ < fmax- The study of 9 when f(x) = will be discussed at the end. Note 
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that, in both the considered cases, we have \9q\ > Ch and \9q\ > Ch. This fact 
will be used in what follows. We first study 9 when \f(x)\ > Ch. The study of 
82 when \f(x)\ < f ma x is the same with the translated data % = Yj+ f max + Ch 
and f = f + fmax + Ch. Note that / and / have the same partial derivatives 
thus 9* and 9* have the same last d coordinates which are the only ones of 
interest for the selection step. 

Proving Theorem 1 can be viewed as a problem of sign consistency of the 
Lasso estimator 9 = (6i, . . . , 9d) (the vector made of the d last coordinates of 
9). To solve this problem, we follow the lines of Zhao and Yu (2006). We remark 
that, we treat carefully the problem of uniqueness of the LASSO contrary to 
the work of Zhao and Yu (2006) where uniqueness of the LASSO estimator was 
assumed. 

We first treat the problem of uniqueness of the LASSO. We introduce the 
function 

4>{9) := \\Z - A9\\ 2 2 + 2\\\9\\ 1 , V0 G M d+1 (8) 
and we say that 9 G K d+1 satisfies the system (S) when 

UA.j) t (Z - A9) = -Asign(^) if 9 3 jt 
• ? _u ''--' a '\ \{A.i)\Z - AS)\ < A if 9j = 

where, for any j £ {0, . . . , d}, the vector Aj is the j-th column of A. It is known 
that 9 G M. d+1 belongs to O(A) if and only if 9 satisfies the system (S). 

Lemma 1. If 9 G R d+1 and 9 {2) G R d+1 are two solutions of (S) then A9 = 
A¥ 2 \ 

Proof of Lemma 1. We denote by S{9) the set {j G {0, . . . , d} : 9j ^ 0}. For 
any v G we have 

0+v)-0) = ^ E \^+Vo\-\Oi\-v i B\gn{6 j )+2X ^ \ Vj \- mVj + \\Av\\l 
jes(e) j(jts(e) 

where rjj = \- l {A. :j ) t (Z - AO). For any j G S(9), we have \9 3 + v 3 \ - \9A - 
VjSign(9j) > and for any j ^ S(9), we have \ < 1 so \vj\ — r/jVj > 0. Hence, 

cp{9+v)-m>\\M\i 

We take v G R d+1 such that 9 {2 ^ = 9 + v. The vectors 9^ and 9 are both 
solutions of (S), thus they are minimizers of <j> and so ^>(^ < - 2 - ) ) = <f>{9). Therefore, 
we have \\Av\\\ = 0. □ 

Next, we prove a result which deals with the identifiability of the model as 
well as the uniqueness of the LASSO. We introduce the event 
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Proposition 1. There exists two constants cq and c\ depending only on fj, m , Hm 
and Mk such that, under Assumption 2 and the first point of Assumption 4 with 
< h < r\, we have 

P[^oi] > 1 - Co cxp(c d - c Y nh d ). 
Proof of Proposition 1. Let 9 £ W l+1 . We have 

1 n Y 

\\A6\\ 2 2 = - Y < Z k ,0> 2 where Z k := y/na k U*( k ~ X 

k=l 

It is easy to see that | < Z k ,9 > 2 | < (2M K /h d )\\0\\l and V(< Z k ,9 > 2 ) < 
(MkUm /h d )\\9\\2- Let < 7 < 1 be a number that will be chosen wisely 
latter. Bernstein's inequality yields that, with probability greater than 1 — 
2exp(-9nft d 7 2 /W(16Mtf)) ) " 

\\\A9\\l-E\\A9\\ 2 2 \<(3/2h(, M \\6\\l 

Moreover, we have E|| A6>|j| = j Rd K(t)(U{t)9) 2 fi(x + ht)dt. To simplify the 
proof we will suppose that (J Rd K(t)Ui(t)Uj(t)dt)o<i,j<d = Id+i but the proof 
still holds when this matrix is diagonal with positive coefficients independent 
of d as in Assumption 2. Then we obtain that Mmll^lll < IE| | 1 1 § < /xmII^IH- 
Thus, with probability greater than 1 — 2exp(— 9nh d j 2 /j,m / {^6Mk)), we have 

(fx m - (3/2)wm)\\6\\1 < \\A0\\l < (mm + (3/2) 7MM )||e|| 2 . (9) 

To control the probability measure of &oi, we need a uniform control over 
9 £ R d+1 of 1 1 ^46* 1 1 2 • F° r that we use a classical e-net argument. For the sake 
of completeness, we recall here this argument. Let e > be chosen wisely 
later and N e be an e-net of S d (the unit sphere of for the || ■ ||2-norm. 

Using an union bound and equation (9), with probability greater than 1 — 
2\NJexp(-9nh d j 2 fj, M /('i-6M K )), we have 

(/i m - (3/2) 7MM ) < \\A9\\ 2 < (fx M + (3/2) 7 /i M ), V0 G N e . (10) 

Now, we want to extend the last result to the whole sphere S d . Let 9 £ S d . 
There exists 9 £ N £ such that \\6 - 6 \\ 2 < e. If 9 ^ 6 , there exists 9 X £ N e 
which is e-close to (9 — 6o)/\\9 — 0o|b- Using this argument recursively, we obtain 
that there exists a sequence ($j)j>o of non-negative numbers such that 5q = 1 
and \Sj \ < e 3 ',Vj > 1 and a sequence (0j)j>o of elements in N e such that 

00 

Thus, for any 9 £ S d , we have 



\A6h= aI^A) <^|^|||^-|| 2 <— max||^|| 2 (11) 

\j=Q / 2 j=0 
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and 

ll^lb 




>|M0o||-5>illl^il|3 
3=1 

> min \\A9\\ 2 — max |U0|| 2 . (12) 

0GN e 1 — 6 8£N C 



We take 7 = /J- m /(3(J,M) and e = (l/4)y / /x m /(3/^A/). Wc know that there 
exists an absolute constant c > such that \N e \ < (c/2)e~ d . Using this fact 
and equations (10), (11) and (12), with probability greater than 1 — cexp(corf — 
C\nh d ), we have 




where Co = log[4^ (3/^m)/ Mm] V c and Ci = ^/{Mk^m)- We complete the 
proof by applying this result to the vector 0/||0||2 for any 9 G M. d — {0} (the 
result is obvious for 9 = 0). □ 

We introduce the squared matrix of Aid+i 

:= A* A and ¥ n) 

w 



fe=i 



where we recall that £/o(w) = 1 and Ui(v) = Uj, i = 1, . . . , d for any v G 

To simplify notation and without loss of generality, we will assume, in what 

follows, that the interesting indexes are given by the first d* coordinates. Namely, 

we will assume (but we did not use it to construct our procedures) that 

(ii, . . . , id*) = (1, . . . , d*) and then J = {1, . . . , d*}. 

We introduce some notation. The vector 9* and the matrices = A 1 A 

and A can be written as 




where *n G M d -+i, #12 G Md*+i,d-d*, *2i G Md-<p,d*+i, *22 G Md-d*, 
A W G M„,d*+i, A( 3 ) G M n , d -d; 0fo G W r+1 and 0^ 2) G K d - d *. We remark 

that, with the notational simplifications, 9*^ is the null vector of . 
Lemma 2. On the event the following statements hold: 

• the LASSO selector exists and is unique, 

• all the eigenvalues of^ n \ and ^22 belong to [/i m /8, 6/ijvf], 

Proof of Lemma 2. For the first point, we use the convexity of the function </> 
(introduced in equation (8)) to obtain the existence of a LASSO selector. By 
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Lemma 1, two LASSO selectors are in the kernel of A. On the event fioi, the 
kernel of A is {0}. Thus, there is uniqueness of the LASSO on fioi- 

For the second point, we know that the eigenvalues of ^ /< - TI - ) are the square of 
the singular values of A. On the event fioi, the singular values of A belong to 
[(1/2) vW/2, /2] . This completes the proof for Now, let A > be 

an eigenvalue of and Vm) 6 M. d +1 be an eigenvector associated with A. We 
denote by ur^) the null vector of W. d ~ d . We have 

M\u(i)\\ 2 2 = ||^(i)M(i)Hl = ||A(ufouf 2) )*||§ < GumMuI^uI^YWI = 6hm\\u (1) \\1 

thus A < For the same reason, we have A > /x m /8. The proof for if? 2 2 

follows the same argument. □ 

We consider the event 

«02 := {Vj G R + 1, . . .,d},Vk G {0,.. .,<f } : |(*ai)ifc| < 2hL fl M K }. (13) 

Lemma 3. We assume that Assumption 2 and Assumption 4 hold. There exists 
a constant C3 depending only on Mk and ptM such that the following holds. 
We have 

F[^o 2 ] > 1 - 2(d* + l)(d - d*) exp(-c 3 n/i d+2 ). 

We take h such thatO < h < fi rn [32(d* + 1)L i1 Mk]~ 1 An. On the event OoiHf^; 
we have 

vj = <r + i,...,d K^aiC*!!) -1 ^^!))),-! < 1/2. 

Proof of Lemma 3. The first point is a direct application of Bernstein's inequal- 
ity and of the union bound. We use both assumptions of the lemma to upper 
bound the expectation |E(\I/i2 )jfc | < hL^Mx. 

For the second part of the lemma, let j £ {d* + 1, . . . , d}. On the event fioi, 
the maximal eigenvalue of ^f^ 1 is smaller than 8/ fx m , thus, we have 

d' 

KMnWc^i)));! = |E(^)^(*uW(«(i)))* 

< (E(*2i) 3 2 fe ) 1/2 |l*n^(i))H2 

fe=0 

< Vd r TT(2hL l ,M K ) 8 ^ d * + 1 < 1/2. 

Mm 

□ 

Remark. // we have Kip^ = Id+i, then we don't need any restriction on h. 

Because, in this case, we can obtain that, with high probability, V(9 G 

(1 — 7)||^||2 < ll^l^lh < (1 + 7)||0||2- Thus, with the same probability, we have 

V<9 G R d+1 ,(l-7)||6»|| 2 < ||¥0|| 3 < (1+7)||6»|| 2 . 
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By applying the last inequality to the vector 8q = ('f^sign^^ )*()')*, we obtain: 

d* + i + n^^sigr^a))!^ = 11(^(^1))*, [tfaxtfii^Pci))]" 1 )!!! 

= ll^olli < (i + 7) 2 II^||^ = (i + jfW^^ei^wl < (i + 7 )(<f + 1). 

Thus, we get |j^ , 2i^ / n ls iS n (^(i))ll2 — 7(^* + !)• Thus, for 7 small enough we 
have |(^2i^n ls i§ n (^(i))).j'l ^ 1/2- The problem, is that, in general, the dictionary 
cannot satisfies Kip^ = I d +i ■ We just have 

(ji(x) - m Q h)I d+1 < E*W < (ji(x) + mi h)I d+1: 

with mo and mi two positive constants. 
We consider the following events: 

^0 := ^01 H f2o2, 

Oj := {Vj = 0,...,d*: K^W^j - Al^sigTi^i)))^ < l(*(i))il} 

and 

« 2 := {Vj = d* + l,...,d : |(*ai*n ~ ^(2))j| < V 2 }. 

where Wq_) = ^(i) e e R d * +1 and W( 2 ) = A^e G R d ~ rf *. For notational simplic- 
ity, the indices of the coordinates of any vector in M. d +1 start from and go to 
d* , and for any vector in R d ~ rf the indices start from d* + 1 and go to d. We 
remark that, we work only on the event f^o on which the minimum eigenvalue 
of "fii is strictly positive. Thus, on this event, $n is regular and so f^o H ill 
and f2o H f^2 arc well defined. 

Proposition 2. LetO <h< fi m [32(d* + l)L fi M K ]- 1 A n and Assumption 2 and 
Assumption 4 hold. The event {V# G R d+1 solution of(S), we have sign(0) = 
sign(#*)} nf^o contains the event f^n^inf^- VFe recall that for any 8 G R d+1 , 
t/ie vector 8 = (61, ... , 8 d ) is the vector made of the d last coordinates of 9. 

Proof of Proposition 2. We consider the linear functional 

R d'+i — y R d*+i 

.— ► 8 - 8* {1) - + Xa {1) 

where we denote by ctri) the vector ^^"sign^/U). For any v = (vq, ■ ■ ■ , v d y G 

R d * +1 and r = (r , . ..,r d ) G (R;) d * +1 , we set B(x,r) = n.f =0 (xj - r^Xj+rj). 
For any vector v = {vj)j, we set \v\ = (\vj\)j. We have F(B(6%y\6%,\)) = 

B(-^W (1) + \a,(i),\6* {1) \). On the event Q h we have € ^-^^(l) + 
Aa(i),|^ 1 J). Hence, there exists G B^w^lJ) such that = 0. 
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That is 0(i) = 0*^ + #ilW(i) - Aa (1) and - 0^\ < \0*^\. Thus, wc have 
sign(0( 1 j) = signal,) and so 

*ii(%)-^i)) = -ASgn(^(i)). (14) 

Using Lemma 3, on the event fin, we have for any j = d* + 1, ...,d, 
|(^2i*ri lsi g n ( (i)))j l < V 2 - Thus, on the event tt n fi 2 , we have 

-\ld-d* < *2i*ri 1 ^(i)- A *2ia ( i ) -lU (2 ) =^2i{0(i)-9% ) )-W {2 ) < Ald-d*, 

(15) 

where the last inequality is coordinates by coordinates and ld-d' is the unit 
vector of R d ~ d ' . 

We consider the vector 8 = (#*!), #' 2 ))* e where 0( 2 ) = d _ d » is the null 

vector of K d_d *. We thus have sign(#) = sign(#*) (because sign(0(!)) = sign(6>^) 
and 0(2) = Od-a" = #( 2 ))- Moreover equations (14) and (15) are equivalent to 
say that satisfies the system (S). 

In particular, we prove that on the event Oo ("1 Oi D f2 2 , there exists £ K d+1 
solution of (S) such that sign(0) = sign(0*). Wc complete the proof with the 
uniqueness of the LASSO on the event fio- d 

Proposition 3. There exists cq > and c\ > depending only on L^, fx m , (jLm> 
Mk,L,C and a such that the following holds. Under the same assumption as 
in Theorem 1, we have 

F[Q n Qi n fi 2 ] > 1 - ci exp(cid - c nh d+2 ). 

Proof of Proposition 3. We study the probability measure of f2 2 . We have 

n§cu2 =( ,. +1 {y >A/2-|&jl}, 

where & = (6 d .+i, . . . , b d f = G(Ai, . . . , A„)* and C = (Cd*+i, • • • , CO* = 
G(aiei,...,a n e n y with G = (gij)d*+i<i<d;i<j<n = ^i^u^'i) _ ^(2) tliat 
satisfy GG* = A\ 2) (I - A( 1) ^A t {1) )A( 2) . The matrix B = I - A^V^ A* (1) 
is symmetric and B 2 = B, then its eigenvalues are and 1. Moreover, ac- 
cording to Lemma 2, on the event fin, the eigenvalues of ^22 = ^(2)^(2) are 
smaller than 6/im- Since, GG* = A t ^ j BA( 2 ), the eigenvalues of GG* are smaller 
than 6/iM- For j € {cT + 1, ...,d}, this implies that Y^k=i9jk = (GG t )jj < 
sw PuefL d - d " : ||m||o=i ||GG*w|| 2 < and that £j is a zero-mean Gaussian vari- 

able with variance satisfying 

V |x(0) = o- 2^i* a * ^ ^ . ( 16 ) 

fe=l 

where V|x stands for the variance symbol conditionally to X = (X\, . . . , X n ). 
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Moreover, for any j G {d* + 1, . . . , d}, we have 

1/2, 



\b j \ = \j29 j kA k \<( J £s%) ||A|| 2 < V6^I|A|| 2 . 

fe=l k=l 

The ?2-norm of A can be upper bounded, with high probability, by using Bern- 
stein's inequality We have 

H A ll2 = "EVj, where V l = ^ K {^^)(f(X i )-P 1 f(X i ,x)) 2 

i=l 

and we use Assumption 5 to obtain | Vi \ < h 2 ~ d L 2 MK and V(V*) < fiMh A ~ d L i MK- 
Thus, there exists an event f^3 of probability measure greater than 

I - exp( - {3/8)fjL M nh d fj, 2 M M^), on which ||A||| < E||A||| + fx M L 2 M K h 2 . It 
is easy to see that E||A||| < fj,ML 2 Mxh 2 . Thus, on the event fl3, we have 

II AH! < 2/Ji,/i 2 Mjf/! 2 and so maxij=d*+i,...,d \bj\ < 2\fZMK^MLh. 

We have A = 8\/3MKfJ-AiLh and, by using the classical upper bound on the 
tail of Gaussian random variables and (16), there exists a constant cq depending 
only on hm, Mr, L, cr such that 

d 

p[fi||x g o n n 3 ] < > a/2- ibjiixe n nfi 3 ] 

j=d'+i 

d 

< ]T P[|OI>V4|XGn nn 3 ] 

^ (a a*\ c ° { nh d+2 \ 

< (d-d ) -. cxp 5 — , 

where P[-|X g l!ofl is the probability conditionaly to X and to the event 
fl n 3 . 

Here, we study the probability measure of f2i. We have 

o?cu^: {ioi ^i^i-ak-i-i^i}, 

wherea = (a , . . . , a d .)' = (*n)" 1 sigS(^ 1) ), £ = (£0, • • • , = *n^(i)(«i e i> 
. ..,a n e n y and 6 = (b , ...,b d ,f = ^/^(Ar, . . . , A»)*. 

The random variable £ is of the form H(a\e\, . . . : a n e n ) T with = 
(fttj)o<i<d*;i<j<n = *n lj4 (i) tnat satisfy HH* = tf^ 1 . For j = 0, . . . ,d*. the 
random variable £j is a zero-mean Gaussian variable with variance (conditionally 
to X) a 2 = a 2 J2k=i h 2 k a k satisfying, on the event Sl , 

, a 2 M K . 8a 2 M g 

The last inequality holds, because, on the event f^o, the maximum eigenvalue of 
Vt^ 1 is smaller than 8/fj, m . 
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Moreover, using Lemma 2, on the event r2p , we have, for any j G {0, , . . . , d*}, 



< ||a|| 2 < ||sign(0 (1) )|| 2 < < 

Mm /^m /^m 



and 




16,1 < (£/&] IIA|| 2 < c^r/^ll^lb < -- 

pi 



\fc=i 

We use the same argument as previously, to obtain that, on the event £^3, 
the Z2-norm of A satisfies, ||A||2 < 2/iA//L 2 M^-/i 2 and so maxj = o,...,d* \bj\ < 
(8/n m )V2ti M M K Lh. 

Since A = Sy/3Mfcfii^Lh, we have X\aj\ + \bj\ < 36(/iM / [A m )LMK\/doh. 
Thus, by using the classical upper bound on the tail of Gaussian random vari- 
ables, the upper bound on the variance of the £j's and Assumption 6 with 
C > 72(/ijif / [i m )LMKvdo, there exists a constant c\ > depending only on 
Mm, fJ>M , Mk, C and a such that 

d* d* 

p[n;|nonn 3 ] < 5>[|&| ^ 1^1 _ A K'I - IM] <X>0fcl > C V2] 

V Cl J 

Thus, there exists a constant C2 depending only o n fj, m , /l«m, L, C and c 
such that P[fif U f^X G O n n 3 ] < d(c 2 /Vn/^+ 2 ) exp(-7i/i d+2 /c|). Finally, 
using the results of Proposition 1 and Lemma 3, we obtain an upper bound on 
the probability measure of the event f2o- Combining this upper bound and the 
result on the probability measure of Q3, there exists C3 and C4 depending only 
on L M , fi m , hm-i Mk, L, C and a such that 

F[Q a nOin0 2 ]>l-c 4 cxp(c 4 <i - nh d+2 c 3 ). 

□ 

Theorem 1 follows by applying Proposition 2 and Proposition 3. For the study 
of 9 when f{x) = 0, we just need to "move" the first coordinate of 9* to the end. 
Namely, we can use the same arguments as previously for the following model 

Z = A9*+ e, where 9* = . . . , 9* d+1 f = {hdxf(x), . . . , hd d f(x), f(x)Y (17) 

and the lines of the design matrix A G M-n,d+l are A\ = onU ( X )~ x ) ,i 
I .... . /; (with U(v) = (ui,..., Vd, 1))- In models (7) and (17), all the null co- 
ordinates of 9* and 9 are at the end of the vector and all the non-zero co- 
ordinates are at the beginning of the vector. Then, we just have to consider 
A = (A^Aq)), with j4 (1) G M n ,d* and A (2 ) £ M n ,d-d'+i, the matrices 
^ = A t A and ^>ij = A^A^ji, j = 1,2. The proof, in this case, follows the 
line of the previous proof, with these notation and some minor changes in the 
indices. 
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4.2. Proof of Theorem 2 

Let 5 > 0. We have 
F[\f(x)-f(x)\>6] 

= P[|/(aO - f(x)\ > S\J 2 = j]p[J 2 = J] +P\\f(x) - f(x)\ > S\J 2 j]p[J 2 J] 



< 



\f(x) - f(x)\ > S\J 2 = J +F[J 2 + J}l S <(2f max )Z 



'2,-1 



< ci cxp(-c 2 n 2 ' J +< 1 * S ) + ci exp(cid - nh + co)l«5<(2/ max ) 2 , 

where, on the event { J 2 = J}, we used the classical result on LPE (cf. Audibert 
and Tsybakov (2007)) and, for the event { J 2 ^ J}, we upper bounded its 
probability measure by using Theorem 1. The assumption on d completes the 
proof. □ 
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